# Data importing
# Import files from google drive: https://drive.google.com/file/d/1LpV_Kas9YU_OJcaz5_Yy5YhowI6sV6y_/view?usp=sharing
# Import Images from google drive if needed: https://drive.google.com/drive/folders/1os4vBvRymakpskWkiFk1B-cASCMevNTU?usp=sharing
# GSE21374 = getGEO("GSE21374")[[1]]
# GSE36059 = getGEO("GSE36059")[[1]]
# GSE48581 = getGEO("GSE48581")[[1]]
# GSE46474 = getGEO("GSE46474")[[1]]
# GSE129166 = getGEO("GSE129166")[[1]]
# saveRDS(GSE21374, "data/GSE21374.rds")
# saveRDS(GSE48581, "data/GSE48581.rds")
# saveRDS(GSE36059, "data/GSE36059.rds")
# saveRDS(GSE46474, "data/GSE46474.rds")
# saveRDS(GSE129166, "data/GSE129166.rds")
# CPOP Model Building Datasets
GSE21374 = readRDS("data/GSE21374.rds")
GSE36059 <- readRDS("data/GSE36059.rds")
GSE48581 <- readRDS("data/GSE48581.rds")
# Auxilary Datasets
GSE46474 <- readRDS("data/GSE46474.rds") # Builds Biological Sex knn
user_input <- readRDS("data/GSE129166.rds") # External dataset used to generate final model, full GSE used here rather than an expression csv as was seen in the Shiny app to decrease pipeline complexity
Over 11,000 kidney transplants have occurred in Australia and New Zealand over the last decade, but long-term graft survival rates are still not favourable (Doucet et al., 2021). The product created by our team aimed to help clinical researchers investigating this area of study through three methods. Firstly, we provided clinical researchers with a pipeline to add public datasets to their in-house data in order to increase the number of kidney biopsy gene expression observations available. This pipeline accounted for batch-effects through a pairwise ratio normalisation approach (Wang et al., 2020). Due to the lack of clinical variables such as sex in public datasets, we also aimed to build a model to extract biological sex from biopsy expression data. Finally, we aimed to provide the researcher with stable features across multiple CPOP models built between their in-house dataset and public datasets (figure 1).
We found that our biological sex model was able to predict the sex on a testing dataset with 100% accuracy, implying strong applicability in extracting sex from kidney gene expression data. We were also able to normalise all datasets effectively to produce a homogenous set of combined data, which when experimented on showed fairness across both sex and data subsets.
Thus, this pipeline should be useful for researchers who need more data to build models for kidney graft survival outcomes, as well as stable genetic features to research further. In the future, this platform can be expanded to support peripheral blood expression data, as well as extracting more clinical variables from expression data such as immunological age.
Between 2000 and 2015, there were nearly 11,000 kidney transplants in Australia and New Zealand (Doucet et al., 2021). Waitlists for a transplant can be up to seven years due to the scarcity of viable organs, and the growing demand for the elderly (Doucet et al., 2021). As such, maximising the likelihood of a successful kidney transplant is crucial. Long-term graft survival rates have shown little improvement over the last decade, and as a result is at the forefront of the work conducted by renal clinical researchers (Knoll, 2008).
The last decade has seen a large emphasis placed on reconciling gene-expression data with transplant outcomes in order to understand outcomes from a genetic and physiological perspective. Krajewska et al. (2019) found that the expression of certain genes involved in immune regulation may have an influence on long-term graft survival. However, understanding the interaction between genes and graft survival requires understanding genetic pathways, and not just genes in isolation (Damman et al., 2015). This signifies the importance of understanding the interaction between genes, or relative expression levels, as opposed to individual gene expression levels itself.
With genomics data, researchers often encounter the issue of having numerous genes and associated expression data, but small sample sizes. From a genetic research perspective, a better understanding of genes and genetic pathways can maximise the chance of stable outcomes for patients, but from a model building perspective, more samples are required to adequately test interesting features for effectiveness in predicting graft survival outcomes.
While there is publicly available kidney biopsy gene expression data from previous studies, combining it with a researcher’s in-house dataset is not straightforward due to the batch-effects associated with collected expression data (Johnson et al., 2007). If proper normalisation processes are not followed, any features selected by a researcher’s model will not be transferable to other biopsy expression datasets. Further, many existing public datasets lack clinical variables such as biological sex which have shown to impact renal health (Si et al., 2009).
Hence, the aims of our product are the following: Firstly, to aid clinical researchers in their own model building, we combine their in-house data with existing public data after first normalising by taking pairwise log-ratios of genes. In doing so, we aim to create a new, homogenous set of expression data that should be fair across all sub-datasets. We also aim to extract biological sex from historical kidney biopsy expression data using common sex biomarkers and append this to the combined expression dataset. Lastly, we aim to provide researchers with a set of potential features that could be worth researching by considering stable features across multiple Cross Platform Omics Prediction (CPOP) models (Wang et al., 2020). These features – ratios between genes – may point towards specific genetic pathways that could help researchers explain differing graft survival outcomes in transplant recipients.
We produced an analysis pipeline which ingests a clinical researcher’s gene expression data from a kidney biopsy study (figure 1). Before any normalisation is done, we first convert all probe identification names from the Affymetrix platform into gene names, as well as cleaning the graft survival column to a binary stable or reject outcome.
The three public datasets used were GSE21374, GSE36059 and GSE48581. We chose these datasets as they were all kidney biopsy datasets, and all had graft survival outcome columns along with over 200 observations. As these were all expression datasets from the Affymetrics platform, the probe ids were standardised, making it straightforward to convert them to gene names.
Figure 1
According to Si et al. (2009), specific genes such as XIST and EIF1AY are expressed preferentially compared to other genes in biopsy analyses. Therefore, our approach to build the biological sex model was as follows. We took the aforementioned genes, along with one baseline gene – ANKRD44 - and computed the pairwise log ratios of the expression values for each of those genes. Because EIF1AY is found on the male sex chromosome, its relative expression should be higher than the other two genes for each male observation, and lower for each female observation (Si et al., 2009). We were then able to build a kNN model (k = 3) where the features were the pairwise ratios of the expression of the three chosen genes (figure 1).
Due to limited datasets with known biological sexes of participants, we had to test our model on kidney peripheral blood expression data. While we were initially apprehensive about this, Jansen et al. (2014) suggests that the same sex biomarkers XIST and EIF1AY are differentially expressed in blood expression as well. Hence, we were able to evaluate our model by running a repeated CV on GSE46474 – a blood dataset – and still expect our model to translate to a biopsy expression dataset in our analysis pipeline (figure 6).
Due to the issue of batch effects across datasets, we first needed to normalise the public datasets along with the in-house dataset. Firstly, we filtered the datasets by only including the 200 most variable genes as this would significantly reduce computation time compared to the 50,000 genes that had expression data to begin with. We then took the log of all the expression values and then found the pairwise ratios of each of the 200 genes (Wang et al., 2020) (figure 1). We were able to evaluate this normalization process visually, by producing boxplots of the expression ranges for each observation in each dataset.
Once the datasets were shown to be homogenous, we were able to combine the public and in-house data together, as well as adding biological sex labels where missing. To test, we used GSE129166 as example in-house data, however any Affymetrix based expression set is compatible with our pipeline.
An important consideration here was ensuring that models built on the combined dataset by researchers would perform fairly across sex, and across subsets of the data. To evaluate this, our methodology involved building a kNN model (k = 10) using the combined dataset, and then comparing the accuracy and positive predictive parity for each subgroup, when grouping by biological sex or by the original dataset (figure 7). We chose the following evaluation strategies to ensure that it is possible to build effective graft survival predictive models on the larger dataset without any bias in outcomes towards a specific sex or initial dataset. Otherwise, this would significantly lessen the usability of the combined dataset for a researcher as it limits model transferability.
To achieve the aim of providing a set of features that could be worth further research, we ran parallel CPOP models using the in-house dataset and each of the three public datasets after normalisation. Each model provided a set of features, i.e. ratios of selected genes. We chose the intersection of features across all three models. This is a similar approach to an ensemble classifier system, except we are using multiple models to generate stable features across datasets, as opposed to predicting outcomes themselves (Rokach, 2010). The justification behind this methodology is that even though the intersection of features may be small, it allows the researcher to investigate very specific pathways, and eliminate noise that is often associated with large scale expression datasets. This may result in more precise relationships between the physiology of specific genetic pathways and graft outcomes when they use these features in their own model building procedure. The feature map produced can be seen in the appendix (figure 8). We evaluate this procedure by doing a repeated cross-validation of the individual CPOP models used to generate features as this will indicate the performance of the models we are using in order to justify feature selection to the clinical researcher (figure 9).
# Transforms Affymetrics probe ID into Gene Name
gene_names = function(gse) {
fData(gse)$`Gene Symbol` = unlist(lapply(strsplit(fData(gse)$`Gene Symbol`, " /// ", 1), `[`, 1))
idx = which(!duplicated(fData(gse)$`Gene Symbol`) & !is.na(fData(gse)$`Gene Symbol`))
gse = gse[idx,]
rownames(gse) = fData(gse)$`Gene Symbol`
return(gse)
}
GSE21374 = gene_names(GSE21374)
GSE36059 = gene_names(GSE36059)
GSE48581 = gene_names(GSE48581)
GSE46474 = gene_names(GSE46474)
user_input = gene_names(user_input)
# Convert reject and stable labels to binary outcome vector (0,1)
p_GSE21374 = pData(GSE21374)
p_GSE48581 = pData(GSE48581)
p_GSE36059 = pData(GSE36059)
p_ue = pData(user_input)
p_ue = p_ue %>% filter(characteristics_ch1 == "tissue: kidney allograft biopsy") # GSE129166 has a mix of blood and biopsy samples so we only extract the biopsy samples. In the shiny app, we assume that the input data is biopsy only
ue_biopsy_samp = rownames(p_ue)
p_GSE36059$diagnosis = ifelse(p_GSE36059$characteristics_ch1 == "diagnosis: non-rejecting", 0, 1)
p_GSE48581$diagnosis = ifelse(p_GSE48581$characteristics_ch1.1 == "diagnosis (tcmr, abmr, mixed, non-rejecting, nephrectomy): non-rejecting", 0, 1)
p_GSE21374$diagnosis = ifelse(p_GSE21374$characteristics_ch1.3 == "rejection/non rejection: nonrej", 0, 1)
userOutcomes = ifelse((p_ue$characteristics_ch1.1 == "tcmr (no: 0_borderline:1_TCMR:2): 0") & (p_ue$characteristics_ch1.2 == "abmr (no: 0_Yes:1): 0"), 0, 1)
y1 = as.factor(p_GSE36059$diagnosis)
y2 = as.factor(p_GSE48581$diagnosis)
y3 = as.factor(p_GSE21374$diagnosis)
userOutcomes = as.factor(userOutcomes)
# Transforms and removes genes with low variance
exp_GSE36059 = (exprs(GSE36059)) # Import Expression data
Variance = rowVars(as.matrix(exp_GSE36059)) # Extract variance
Variance = as.data.frame(Variance)
exp_GSE36059 = as.data.frame(exp_GSE36059)
exp_GSE36059 = cbind(exp_GSE36059, variance = Variance) # Add variance column to expression data
exp_GSE36059 = slice_max(exp_GSE36059, order_by = Variance, n = 300) # Orders expression data from most to least variable then takes the top 300 values
exp_GSE36059 = subset(exp_GSE36059, select = -c(Variance)) # removes the variance column
row_names_exp_GSE36059 = rownames(exp_GSE36059) # Takes the gene names for later
exp_GSE21374 = (exprs(GSE21374))
Variance = rowVars(as.matrix(exp_GSE21374))
Variance = as.data.frame(Variance)
exp_GSE21374 = as.data.frame(exp_GSE21374)
exp_GSE21374 = cbind(exp_GSE21374, variance = Variance)
exp_GSE21374 = slice_max(exp_GSE21374, order_by = Variance, n = 300)
exp_GSE21374 = subset(exp_GSE21374, select = -c(Variance))
row_names_exp_GSE21374 = rownames(exp_GSE21374)
exp_GSE48581 = (exprs(GSE48581))
Variance = rowVars(as.matrix(exp_GSE48581))
Variance = as.data.frame(Variance)
exp_GSE48581 = as.data.frame(exp_GSE48581)
exp_GSE48581 = cbind(exp_GSE48581, variance = Variance)
exp_GSE48581 = slice_max(exp_GSE48581, order_by = Variance, n = 300)
exp_GSE48581 = subset(exp_GSE48581, select = -c(Variance))
row_names_exp_GSE48581 = rownames(exp_GSE48581)
userExpression = exprs(user_input)
Variance = rowVars(as.matrix(userExpression))
Variance = as.data.frame(Variance)
userExpression = as.data.frame(userExpression) %>% select(ue_biopsy_samp) # Only chooses biopsy samples
userExpression = cbind(userExpression, variance = Variance)
userExpression = slice_max(userExpression, order_by = Variance, n = 300)
userExpression = subset(userExpression, select = -c(Variance))
row_names_UE = rownames(userExpression)
# Takes the common genes from the datasets
intersection = intersect(row_names_exp_GSE36059, row_names_exp_GSE21374)
intersection = intersect(intersection, row_names_exp_GSE48581)
intersection = intersect(intersection, row_names_UE)
exp_GSE36059 = as.data.frame(t(as.matrix(exp_GSE36059)))
exp_GSE36059 = subset(exp_GSE36059, select = c(intersection)) # select rows that contain the probe ids in intersection
exp_GSE21374 = as.data.frame(t(as.matrix(exp_GSE21374)))
exp_GSE21374 = subset(exp_GSE21374, select = c(intersection))
exp_GSE48581 = as.data.frame(t(as.matrix(exp_GSE48581)))
exp_GSE48581 = subset(exp_GSE48581, select = c(intersection))
userExpression = as.data.frame(t(as.matrix(userExpression)))
userExpression = subset(userExpression, select = c(intersection))
z1 = exp_GSE36059 %>% as.matrix()
z2 = exp_GSE48581 %>% as.matrix()
z3 = exp_GSE21374 %>% as.matrix()
userExpression_z = userExpression %>% as.matrix()
# log transforms the data
exp_GSE36059_log <- exp_GSE36059
exp_GSE36059_log = exp_GSE36059_log + 1
exp_GSE36059_log = log(exp_GSE36059_log)
exp_GSE21374_log <- exp_GSE21374
exp_GSE21374_log = exp_GSE21374_log + 1
exp_GSE21374_log = log(exp_GSE21374_log)
exp_GSE48581_log <- exp_GSE48581
exp_GSE48581_log = exp_GSE48581_log + 1
exp_GSE48581_log = log(exp_GSE48581_log)
ue_log = userExpression
ue_log = ue_log + 1
ue_log = log(ue_log)
z1_log = exp_GSE36059_log %>% as.matrix()
z3_log = exp_GSE21374_log %>% as.matrix()
z2_log = exp_GSE48581_log %>% as.matrix()
ue_log = ue_log %>% as.matrix()
# Perform Pairwise difference calculation for later use
z1_pairwise = data.frame(pairwise_col_diff(z1_log) %>% as.matrix())
z2_pairwise = data.frame(pairwise_col_diff(z2_log) %>% as.matrix())
z3_pairwise = data.frame(pairwise_col_diff(z3_log) %>% as.matrix())
# Set up X and Y for CV
X1 = z1_log
X2 = ue_log
y_1 = y1
y_2 = userOutcomes
# Collate the results, observations, and outcome labels into 1 dataframe
observation_df = data.frame(sample = cv_obsv, results = as.factor(cv_res), outcome = as.factor(outcome))
rownames(observation_df) = observation_df$sample
observation_df = observation_df %>% select(results, outcome)
# Calculate stats
acc = confusionMatrix(observation_df$results, observation_df$outcome)$overall["Accuracy"]
b_acc = confusionMatrix(observation_df$results, observation_df$outcome)$byClass["Balanced Accuracy"]
cpop1_stats = data.frame(name = "CPOP1", accuracy = acc, balanced_accuracy = b_acc)
X1 = z2_log
X2 = ue_log
y_1 = y2
y_2 = userOutcomes
# Collate the results, observations, and outcome labels into 1 dataframe
observation_df = data.frame(sample = cv_obsv, results = as.factor(cv_res), outcome = as.factor(outcome))
rownames(observation_df) = observation_df$sample
observation_df = observation_df %>% select(results, outcome)
# Calculate Stats
acc = confusionMatrix(observation_df$results, observation_df$outcome)$overall["Accuracy"]
b_acc = confusionMatrix(observation_df$results, observation_df$outcome)$byClass["Balanced Accuracy"]
cpop2_stats = data.frame(name = "CPOP2", accuracy = acc, balanced_accuracy = b_acc)
X1 = z3_log
X2 = ue_log
y_1 = y3
y_2 = userOutcomes
# Collate the results, observations, and outcome labels into 1 dataframe
observation_df = data.frame(sample = cv_obsv, results = as.factor(cv_res), outcome = as.factor(outcome))
rownames(observation_df) = observation_df$sample
observation_df = observation_df %>% select(results, outcome)
# Calculate Stats
acc = confusionMatrix(observation_df$results, observation_df$outcome)$overall["Accuracy"]
b_acc = confusionMatrix(observation_df$results, observation_df$outcome)$byClass["Balanced Accuracy"]
cpop3_stats = data.frame(name = "CPOP3", accuracy = acc, balanced_accuracy = b_acc)
# Merge the 3 CPOP model features together
rownames(cpop1) = cpop1$coef_name
rownames(cpop2) = cpop2$coef_name
rownames(cpop3) = cpop3$coef_name
results = merge(merge(cpop1, cpop2, all=FALSE, by = "coef_name"), cpop3, all=FALSE, by = "coef_name")
# Calculate the absolute magnitude of the difference between the 2 coefficients
row.names(results) = results$coef_name
results = results %>% select(-c(avg.x, avg.y, avg, coef_name))
results = abs(results)
results$average = rowMeans(results)
# Quick function to simplify the pairwise difference taking. Was use predominantly to see what the different transformations did
pairwise = function(exp_GSE, transform_type) {
z = exp_GSE
if (transform_type == "Arc") {
z = z / max(z)
z = asin(sqrt(z))
z = pairwise_col_diff(z) %>% as.matrix()
}
else if (transform_type == "Log") {
z <- pairwise_col_diff(log(z +1))
colnames(z) <- sub(".*\\.","",colnames(z))
z = z %>% as.matrix()
}
else if (transform_type == "Pair"){
z = z %>% as.matrix()
z_pairwise = pairwise_col_diff(z) %>% as.matrix()
}
z = data.frame(z)
return(z)
}
# Generate visnetwork
coef <- rownames(results)
names = data.frame(feature1 = rep("",length(coef)),
feature2 = rep("",length(coef)),
coef_size = abs(results$average))
names$feature1 = sub("--.*", "", coef) #getting the first node from the coef-name vector of the results df
names$feature2 = sub(".*--", "", coef) #getting the second node from the coef-name vector of the results df
names_uniq = unique(c(names$feature1, names$feature2))
# names_uniq
numbers = names
#numbers
for (a in 1:nrow(numbers)) {
for (i in 1:length(names_uniq)) {
if (numbers$feature2[a] == names_uniq[i]) {
numbers$feature2[a] = i
}
}
}
for (a in 1:nrow(numbers)) {
for (i in 1:length(names_uniq)) {
if (numbers$feature1[a] == names_uniq[i]) {
numbers$feature1[a] = i
}
}
}
#numbers
clr = c()
for (a in 1:length(names_uniq)) {
for (i in 1:nrow(names)) {
if (names_uniq[a] == names[i,1] | names_uniq[a] == names[i,2]) {
clr[a] = names$color[i]
}
}
}
#clr
cleanNames <- sub("--.*","",names_uniq)
edges = data.frame(from = numbers$feature1, to = numbers$feature2, value = names$coef_size)
nodes = data.frame(id = c(1:length(names_uniq)),
label = names_uniq,
# color = clr,
title = paste0('<a target="_blank" href = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=',cleanNames,'">',cleanNames,'</a>'))
#nodes
Figure8 = visNetwork(nodes, edges, height = "500px", width = "100%", main = "Figure 8 Features Intersect Between CPOP Models")
# Refresh the data for use in CV
ue <- as.data.frame(t(as.data.frame(exprs(user_input)) %>% select(ue_biopsy_samp)))
temp_GSE36059 <- as.data.frame(t(exprs(GSE36059)))
temp_GSE48581 <- as.data.frame(t(exprs(GSE48581)))
temp_GSE21374 <- as.data.frame(t(exprs(GSE21374)))
# Below function uses a KNN to predict biological sex for the incoming dataset
calculate_sex = function(dataframe) {
colnames(dataframe) = tolower(colnames(dataframe))
if ("xist" %in% colnames(dataframe) && "eif1ay" %in% colnames(dataframe) && "rps4y1" %in% colnames(dataframe)) {
# Extracts the right genes from the input dataset
outcome_df = dataframe %>% select("xist", "eif1ay", "rps4y1")
outcome_df = pairwise(outcome_df, transform_type = "Log") %>% as.matrix()
outcome_df = data.frame(outcome_df)
# Performs cleaning on the training dataset GSE46474
p_GSE46474 = pData(GSE46474)
p_GSE46474$outcome = ifelse(p_GSE46474$characteristics_ch1.1 == "Sex: M", 0, 1) # Male is 0 and female is 1
exprs_GSE46474 = data.frame(t(exprs(GSE46474)))
colnames(exprs_GSE46474) = tolower(colnames(exprs_GSE46474))
exprs_GSE46474 = exprs_GSE46474 %>% select("xist", "eif1ay", "rps4y1")
exprs_GSE46474 = pairwise(exprs_GSE46474, transform_type="Log")
# Model generation. CV results are in the appendix
model = knn(exprs_GSE46474, outcome_df, p_GSE46474$outcome, k = 3)
return(model)
}
return(NULL)
}
# Below code adds kidney graft outcome, biological sex info from KNNm and which dataset the observations are from
ue$outcome <- userOutcomes
ue$biological_sex <- calculate_sex(as.data.frame(userExpression))
ue$source <- c(rep("userUploadedData", length(rownames(userExpression))))
temp_GSE36059$outcome <- y1
temp_GSE36059$biological_sex <- calculate_sex(data.frame(t(exprs(GSE36059))))
temp_GSE36059$source <- c(rep("GSE36059", length(rownames(temp_GSE36059))))
temp_GSE48581$outcome <- y2
temp_GSE48581$biological_sex <- calculate_sex(data.frame(t(exprs(GSE48581))))
temp_GSE48581$source <- c(rep("GSE48581", length(rownames(temp_GSE48581))))
temp_GSE21374$outcome <- y3
temp_GSE21374$biological_sex <- calculate_sex(data.frame(t(exprs(GSE21374))))
temp_GSE21374$source <- c(rep("GSE21374", length(rownames(temp_GSE21374))))
# Combine the 4 datasets
combinedDataset <- bind_rows(temp_GSE36059,temp_GSE48581, temp_GSE21374,ue )
outcome_source = combinedDataset %>% select(c(outcome, source, biological_sex)) # Temporarily remove non-gene cols in order to do variance analysis on each gene
combinedDataset = t(combinedDataset %>% select(-c(outcome, source, biological_sex)))
# Calculate variance of each gene
vars = rowVars(as.matrix(combinedDataset))
vars = as.data.frame(vars)
# Take top 100 most variable genes
combinedDataset = slice_max(cbind(combinedDataset, variance = vars), order_by = vars, n = 100)
combinedDataset = as.data.frame(t(subset(combinedDataset, select = -c(vars))))
combinedDatasetRaw = combinedDataset # For later boxplot
# Log transform the combined dataset
combinedDataset = combinedDataset + 1
combinedDataset = log(combinedDataset)
combinedDataset = cbind(combinedDataset, outcome_source)
combinedDatasetRaw$source = combinedDataset$source
X = combinedDataset %>% select(-c(outcome, source, biological_sex))
X = pairwise_col_diff(X)
y = combinedDataset$outcome
# Collate results and observations together
observation_df = data.frame(sample = cv_obsv, results = as.factor(cv_res))
rownames(observation_df) = observation_df$sample
observation_df = observation_df %>% select(results)
performance_df = merge(combinedDataset, observation_df, by=0, all=TRUE) %>% select(Row.names, outcome, results, source, biological_sex)
# Calculate statistics for the combined dataset
performance_df$correct = ifelse(performance_df$results == performance_df$outcome, 1, 0)
# Split by source
gse21374 = performance_df %>% filter(source == "GSE21374")
gse36059 = performance_df %>% filter(source == "GSE36059")
gse48581 = performance_df %>% filter(source == "GSE48581")
user = performance_df %>% filter(source == "userUploadedData")
data_acc = c(count(gse21374$correct == 1)/nrow(gse21374), count(gse36059$correct == 1)/nrow(gse36059), count(gse48581$correct == 1)/nrow(gse48581), count(user$correct == 1)/nrow(user))
data_pos = c(count(gse21374$results == 1)/nrow(gse21374), count(gse36059$results == 1)/nrow(gse36059), count(gse48581$results == 1)/nrow(gse48581), count(user$results == 1)/nrow(user))
data_obsv = c(nrow(gse21374), nrow(gse36059), nrow(gse48581), nrow(user))
# Split by sex
male = performance_df %>% filter(biological_sex == 0)
female = performance_df %>% filter(biological_sex == 1)
gen_acc = c(count(male$correct == 1)/nrow(male), count(female$correct == 1)/nrow(female))
gen_pos = c(count(male$results == 1)/nrow(male), count(female$results == 1)/nrow(female))
gen_obsv = c(nrow(male), nrow(female))
source_stats = data.frame(`Dataset` = c("GSE21374", "GSE36059", "GSE48581", "User Input"), Observations = data_obsv, Accuracy = round(data_acc, 2), `Positive Prediction Rate` = round(data_pos, 2))
sex_stats = data.frame(`Biological Sex` = c("Male", "Female"), Observations = gen_obsv, Accuracy = round(gen_acc, 2), `Positive Prediction Rate` = round(gen_pos, 2))
total_perf = caret::confusionMatrix(as.factor(performance_df$results), as.factor(performance_df$outcome))
box_1 = ggplot(data = combinedDatasetRaw, aes(x = rownames(combinedDatasetRaw), y = mean)) +
geom_point(aes(color = source), size = 0.1) +
geom_errorbar(aes(ymin = q1,
ymax = q3,
color = source), size = 0.15, alpha = 0.55) +
theme(axis.ticks = element_blank()) +
theme(axis.text.x = element_blank()) +
xlab("Samples") + ylab("Expression Level (Log)")+
theme(axis.title.y=element_blank()) +
labs(title = "Figure 3 Raw Data (Top 100 Most Variable Genes)") +
theme(plot.title = element_text(size=10), legend.position = "bottom") + scale_color_viridis(discrete = TRUE, option = "inferno", alpha = 1, begin = 0, end = 0.8)
combined_pairwise = combinedDataset %>% select(-c(outcome, source, biological_sex))
combined_pairwise = 1+combined_pairwise
combined_pairwise = log(combined_pairwise)
combined_pairwise = as.data.frame(pairwise_col_diff(combined_pairwise %>% as.matrix()))
combined_pairwise$source = combinedDataset$source
combined_pairwise$outcome = combinedDataset$outcome
combined_pairwise$mean = rowMeans(combined_pairwise %>% select(-c(outcome, source)))
combined_pairwise$q1 = rowQuantiles(combined_pairwise %>% select(-c(outcome, source, mean)) %>% as.matrix(), probs=c(0.25))
combined_pairwise$q3 = rowQuantiles(combined_pairwise %>% select(-c(outcome, source, mean, q1)) %>% as.matrix(), probs=c(0.75))
box_2 = ggplot(data = combined_pairwise, aes(x = rownames(combined_pairwise), y = mean)) +
geom_point(aes(color = source), size = 0.1) +
geom_errorbar(aes(ymin = q1,
ymax = q3,
color = source), size = 0.15, alpha = 0.55) +
theme(axis.ticks = element_blank()) +
theme(axis.text.x = element_blank()) +
xlab("Samples") + ylab("Expression Level (Log)")+
labs(caption = "This figure shows a comparison of expression data pre- and post-normalisation. \n The pre-transformation plot shows the range of expression values for each gene across an observation. \n The post-transformation plot shows the range of values for each pairwise ratio of expression values \n for each observation, indicating stability across all datasets.
") +
theme(axis.title.y=element_blank()) +
labs(title = "Log Transformed + Pairwise Difference") +
theme(plot.title = element_text(size=10), legend.position="none") + scale_color_viridis(discrete = TRUE, option = "inferno", alpha = 1, begin = 0, end = 0.8)
We evaluated the biological sex model by running 50-repeated 5-fold cross validation on the sex outcomes of the GSE46474 dataset. This had a 100% predictive accuracy, which is reassuring considering relative expression for Y-chromosomal genes should be higher in males as females have differing genes at these loci (Jansen et al., 2014). This gives us confidence that the genes chosen to compute relative expression and then build the kNN model were appropriate features (figure 2).
p_GSE46474 = pData(GSE46474)
p_GSE46474$outcome = ifelse(p_GSE46474$characteristics_ch1.1 == "Sex: M", 0, 1) # Male is 0 and female is 1
exprs_GSE46474 = data.frame(t(exprs(GSE46474)))
exprs_GSE46474 = exprs_GSE46474 %>% select(c("XIST", "EIF1AY", "ANKRD44"))
exprs_GSE46474 = pairwise(exprs_GSE46474, transform_type="Log") %>% as.matrix()
exprs_GSE46474 = data.frame(exprs_GSE46474)
X1 = exprs_GSE46474
y1 = p_GSE46474$outcome
cvK = 5 # Number of CV folds
cv_acc_knn = cv_50acc5_knn = c()
for (i in 1:50) {
cvSets = cvTools::cvFolds(nrow(X1), cvK)
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_test = X1[test_id, ]
X_train = X1[-test_id, ]
y_test = y1[test_id]
y_train = y1[-test_id]
#KNN
fit <- knn(X_train, X_test, y_train, k = 3)
cv_acc_knn[j] = mean(fit == y_test)
}
cv_50acc5_knn <- append(cv_50acc5_knn, cv_acc_knn)
}
boxplot(cv_50acc5_knn,ylim = c(0,1), ylab = "Proportion Correct", main = "Figure 2 Repeated 5 Fold CV for the Biological Sex KNN Model")
mtext("This boxplot shows the mean accuracy across 5-fold CV \n for the biological sex model across 50 repeats", side=1,line=3)
After applying our normalisation procedure as outlined in the methods section, we can see a clear visual difference in the expression levels (figure 3). Before any transformation, all four datasets – three public datasets, and an input dataset uploaded by the researcher – have differing expression level scales. This was expected, because even though they are all datasets measuring the expression of genes from a kidney biopsy, batch effects are still problematic. However, our procedure of taking the pairwise ratios of the genes after applying a log transformation show that all four datasets have been brought to the same scale, with all expression data being centred around zero with very similar ranges of expression across all datasets. For the purpose of our analysis, we chose the GSE129166 biopsy expression dataset from the NCBI’s Gene Expression Omnibus to test as the input dataset. The fact that we were able to achieve similar expression values after normalisation across all datasets indicates a high degree of transferability of this process, provided that the input data used by a researcher is also from biopsy data.
grid.arrange(box_1, box_2, ncol = 1)
While expression values have been normalised, in order to fully justify the efficacy of providing a researcher with this combined dataset that includes their in-house data, public expression data, and biological sex labels, we must consider the fairness performance.
Fairness is obviously model dependant, and while the scope of our product isn’t building the researcher a predictive model explicitly, we decided to build an example model, as we intend a researcher to do themselves, and then evaluate how this model performs across sex, and across the subsets of the combined data itself. This is important because we want to ensure that there is not any bias in performance towards a specific dataset or sex. If this bias in performance were to exist, it would indicate that combining the data in this way may not be possible.
Using a kNN (k=10) built on the combined dataset, we see that when stratified by the original dataset, the prediction accuracy of graft survival outcomes is fairly consistent across all datasets at roughly 75% (figure 4). This is reassuring as it suggests that when we built a model on the combined dataset, it was able to predict with similar success regardless of the original dataset a sample came from. This reassures our aim that from a researcher’s perspective, the new dataset we provide them is homogenous with their in-house data.
The positive predictive rate is also similar across datasets, except for GSE48581, however the accuracy being on par with other datasets alleviates this concern (figure 3). This is reassuring as it indicates that graft rejection outcomes aren’t being preferentially selected depending on the dataset an observation was from.
We see similar results when evaluating fairness in terms of biological sex (figure 3). Both accuracy and positive predictive parity are consistent across both biological males and biological females and around 70% and 20% respectively, indicating that when using the combined dataset, it is possible to build a predictive model for graft survival outcomes without biasing a specific sex. This is incredibly useful for a clinical researcher as it then allows them to utilise sex-specific information in their analysis and own model building, which is known to impact kidney health and function, but without needing to worry about model unfairness (Si et al., 2009).
# Figure 4
knitr::kable(list(source_stats, sex_stats), format="markdown", caption = "Figure 4 Fairness Table")
|
|
Accuracy and positive predictive parity, stratified by dataset and by biological sex. Predictions were obtained using a kNN model built on the combined dataset after normalisation is applied.
The cross-validation of the three CPOP models used for stable feature selection showed accuracies of approximately 72%, and balanced accuracies of approximately 62% (figure 5). These are relatively well performing, considering the purpose of this tool isn’t to predict graft survival outcomes, but to give researchers features that could be useful to research. This supports the use of the three separate models, and affirms the use of an ensemble-style feature selection method to deliver to researchers specific features that are stable across multiple models, and may thus eliminate noise when trying to further study the relationship between gene pathways and kidney graft outcomes.
rownames(cpop1_stats) = NULL
rownames(cpop2_stats) = NULL
rownames(cpop3_stats) = NULL
stats = rbind(cpop1_stats, cpop2_stats, cpop3_stats)
ggplot(data = melted_stats, aes(fill = variable, x = name, y = value)) + geom_bar(position="dodge", stat="identity") + ylab("Model") + ggtitle("Figure 5 CPOP Model Accuracy and Balanced Accuracy") + geom_text(aes(label=value), position=position_dodge(width=0.9), vjust=-0.5) + ylim(0,1) + scale_fill_viridis(discrete = T, option = "E") + labs(caption = "This figure shows both the accuracy and balanced accuracy for each of the CPOP models built. \n The CPOP models built were as follows: (1) inhouse -- GSE36059, (2) inhouse -- GSE48581, and (3) inhouse -- GSE21374.
")
Our results when evaluating our feature selection process, biological sex model, and data normalisation/combination justify our final deployment of our Shiny pipeline. All three components of our product can function fairly and independently, allowing researchers to upload in-house data that is able to be normalised, appended with biological sex information, and combined to form a larger dataset along with available public data from the Affymetrix platform (figure 1). Further, the strong performance of the ensemble of feature selection CPOP models reiterates that the stable features generated for the researcher can contribute to the future direction of their clinical research in graft survival outcomes. Our Shiny app was deployed locally as a web application, which in future, should be deployed on a website for clinicians.
From a data science and genetics perspective, during the development of the Shiny app, we experienced the issue of poor integration between blood and biopsy data. When we tried to normalise these different datasets, we struggled to achieve homogeneity, and so had to make our platform for researchers strictly biopsy based. An initial concern was the fact that we could only test our biological sex model on blood data, with hopes of ultimately applying it to biopsy data on our platform. However, the literature supported the same biomarkers for sex being present across kidney blood and biopsy datasets, but this approach may still have some limitations (Si et al., 2009, Jansen et al., 2014).
From the perspective of deploying the application, we also aimed to improve the speed of running CPOP models on large expression datasets by employing parallel processing techniques which reduced CPOP run time from five minutes to forty-seconds, something we are sure clinical researchers would appreciate. Currently, the application can only accept a file that has NAs as its error values, but in future, the application should be able to clean a file with various types of errors.
From a genetics perspective, we decided to only accept gene data files that came from the Affymetrix gene platform as the probe ids of the data files would match. After matching the probe ids, we converted the ids to gene names by selecting the first gene matched to the probe id. Some probe ids correspond to multiple genes, however we chose to only take the first gene as this is generally the most prominent, and allows for 1:1 correspondence between probe id and gene.
Another limitation with the product was not allowing the user to specify the normalisation method required before combining the public datasets with the in-house dataset. While Wang et al. (2020) strongly support the pairwise ratio normalisation approach, some researchers may have preferred to have a final combined dataset with only genes, as opposed to pairwise-ratios of genes as the columns. This may limit the applicability of our product in the eyes of some clinical researchers.
In future, the application can be further developed by adding the feature of extracting immunological age from the expression datasets. Immunological age may provide extra clinical information for a researcher to model with, that doesn’t rely on knowing the chronological age of participants (Alpert et al., 2019) . The application of this could be similar to our biological sex model, and could provide researchers with an opportunity to filter for specific age-cohorts when building their own predictive models (West et al., 2006).
Overall, we were able to locally deploy a Shiny app that allowed researchers to upload in-house kidney biopsy expression data, combine it with existing public data to increase observation size, and receive a set of stable features across datasets that may be worth investigating in their own research. We were also able to extract biological sex information and append this to the combined dataset for each observation. Although we achieved our aim, there is still room for improvement going forward, such as better integration or usage of existing kidney blood datasets, more platform flexibility aside from Affymetrix, and extracting other clinical variables beyond sex, such as immunological age.
Figure 6
Figure 7
Figure8
VisNetwork of stable features selected across the three CPOP models. Each gene is a node, and the feature is represented by an edge. Two nodes connected by an edge indicates that the pairwise ratio in expression of those two genes may contribute to kidney graft outcomes. The size of the edge represented the size of the coefficient as per the CPOP model. Clicking on a node will take the researcher to a website with information about that specific gene and related pathways.
Figure 9
One of my main contributions to the project was adopting the role as overall project leader, which consisted of delegating weekly tasks and making decisions on the types of models to use. I coded the data normalisation by consulting existing literature on CPOP methods, as well as doing all of the preliminary CPOP code. I also decided on what fairness metrics and evaluation strategies to use, and researched how to build the biological sex model. I then helped to implement these models in Shiny. I played an active role for both the presentation and report; creating the slide deck for the presentation, finalising Figure 1 and finalising each of the sections of the report.
My contribution was prototyping the wireframe and coding the shiny app. In the early phases of the project, I built the wireframe using draw io, asked for feedback from the tutors/team members and built a demo of the Shiny app. During week 12, I worked with Mukund and Liam to implement their CPOP code into the Shiny app, where I implemented the following features: 4 loading bars, 1 upload csv file button, 1 pairwise genes table, 1 visnetwork, 2 download buttons, 1 toggle button (to show the 3 visnetworks), 1 boxplot and a select button. I optimised the performance of the Shiny app by including parallelism via the furrr package, where I ran 3 CPOP models in parallel, and implemented caching via loading saved rds objects. I have a total of 46 commits on the github project.
The initial contribution I made to the project consisted of coding data cleaning functions to clean the public datasets, such as cleaning the graft survival column to a binary stable or reject outcome. I also coded a kNN model which was included in our Shiny app, and wrote functions in order to extract gender from gene expression data. All of these functions and models were then implemented into the Shiny app. Additionally, I created and implemented the fairness metrics in the Shiny app. Furthermore, I created the R Markdown for the report, putting together all of my team’s written sections, as well as the code and figures.
My contributions involved designing the shiny app, implementing functions for the app, reiterating figure 1, and drafted the slide deck. I contributed to coding and formatting mostly the shiny UI script and partly the server script to include text descriptions, figures, downloadable files, external links, and outputs, and displayed these in a way that is clear and easily understood by the user. I constructed figure 1 and continued iterations of it after feedback.
My early contribution to the project consisted of coding an SVM model and basic risk classifier, which were ultimately not included in the final Shiny in order to make it more targeted towards clinical researchers. My later contributions consisted of writing text for the Shiny and planning the report for my group by creating a document with details such as the content for each section, so that everyone could simply flesh out these dot points into a comprehensive report, as well as reading through and editing the final report.
ALPERT, A., PICKMAN, Y., LEIPOLD, M., ROSENBERG-HASSON, Y., JI, X., GAUJOUX, R., RABANI, H., STAROSVETSKY, E., KVELER, K., SCHAFFERT, S., FURMAN, D., CASPI, O., ROSENSCHEIN, U., KHATRI, P., DEKKER, C. L., MAECKER, H. T., DAVIS, M. M. & SHEN-ORR, S. S. 2019. A clinically meaningful metric of immune age derived from high-dimensional longitudinal monitoring. Nature Medicine, 25, 487-495.
DAMMAN, J., BLOKS, V. W., DAHA, M. R., VAN DER MOST, P. J., SANJABI, B., VAN DER VLIES, P., SNIEDER, H., PLOEG, R. J., KRIKKE, C., LEUVENINK, H. G. D. & SEELEN, M. A. 2015. Hypoxia and Complement-and-Coagulation Pathways in the Deceased Organ Donor as the Major Target for Intervention to Improve Renal Allograft Outcome. Transplantation, 99.
DOUCET, B. P., CHO, Y., CAMPBELL, S. B., JOHNSON, D. W., HAWLEY, C. M., TEIXEIRA-PINTO, A. R. M. & ISBEL, N. M. 2021. Kidney Transplant Outcomes in elderly Recipients: An Australia and New Zealand Dialysis and Transplant (ANZDATA) Registry Study. Transplantation Proceedings, 53, 1915-1926.
JANSEN, R., BATISTA, S., BROOKS, A. I., TISCHFIELD, J. A., WILLEMSEN, G., VAN GROOTHEEST, G., HOTTENGA, J.-J., MILANESCHI, Y., MBAREK, H., MADAR, V., PEYROT, W., VINK, J. M., VERWEIJ, C. L., DE GEUS, E. J. C., SMIT, J. H., WRIGHT, F. A., SULLIVAN, P. F., BOOMSMA, D. I. & PENNINX, B. W. J. H. 2014. Sex differences in the human peripheral blood transcriptome. BMC genomics, 15, 33-33.
JOHNSON, W. E., LI, C. & RABINOVIC, A. 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8, 118-127.
KNOLL, G. 2008. Trends in Kidney Transplantation over the Past Decade. Drugs, 68, 3-10.
KRAJEWSKA, M., KOŚCIELSKA-KASPRZAK, K., KAMIŃSKA, D., ŻABIŃSKA, M., MYSZKA-KOZŁOWSKA, M., GOMUŁKIEWICZ, A., DZIĘGIEL, P. & KLINGER, M. 2019. Kidney Transplant Outcome Is Associated with Regulatory T Cell Population and Gene Expression Early after Transplantation. Journal of Immunology Research, 2019, 7452019.
ROKACH, L. 2010. Ensemble-based classifiers. Artificial Intelligence Review, 33, 1-39.
SI, H., BANGA, R. S., KAPITSINOU, P., RAMAIAH, M., LAWRENCE, J., KAMBHAMPATI, G., GRUENWALD, A., BOTTINGER, E., GLICKLICH, D. & TELLIS, V. 2009. Human and murine kidneys show gender-and species-specific gene expression differences in response to injury. PloS one, 4, e4802.
WANG, K. Y., PUPO, G. M., TEMBE, V., PATRICK, E., STRBENAC, D., SCHRAMM, S.-J., THOMPSON, J. F., SCOLYER, R. A., MUELLER, S. & TARR, G. 2020. Cross-Platform Omics Prediction procedure: a game changer for implementing precision medicine in patients with stage-III melanoma. bioRxiv.
WEST, M., GINSBURG, G. S., HUANG, A. T. & NEVINS, J. R. 2006. Embracing the complexity of genomic data for personalized medicine. Genome research, 16, 559-566.